feat(otel): instrument runtime with GenAI semantic conventions#2620
feat(otel): instrument runtime with GenAI semantic conventions#2620tdabasinskas wants to merge 14 commits into
Conversation
fa4a01d to
2a69313
Compare
|
@tdabasinskas not sure why, GitHub doesn't want to merge this one, because of hypothetical merge conflicts. Could you rebase? |
2a69313 to
9b08feb
Compare
Done! |
e7194da to
b6a181b
Compare
|
/review |
I don't think that worked 😅 |
|
/review |
aheritier
left a comment
There was a problem hiding this comment.
LGTM. Clean design, solid thread safety, good spec adherence. The inline comments are all non-blocking suggestions for follow-up.
|
❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs. |
|
@tdabasinskas can you rebase one more time and I'll review it? |
Done! |
aheritier
left a comment
There was a problem hiding this comment.
Re-approving — my prior approval was dismissed by the merge of upstream/main into the branch, but there are zero new author code changes since a4ce95e8. All three of my previous comments were addressed and the threads are resolved. CI is green on the merge commit.
Original assessment stands: clean design, solid thread safety, good GenAI semconv adherence. LGTM.
I assume you wanted to tag me here :) I gave write access to the repo - feel free to push changes.
Yes, I know the PR is quite big. From the codebase perspective, there's a lot of impact. I did try splitting this into few bigger commits, that should be reviewable independently. I guess I could split the whole PR into smaller PRs if that helps reviewability. Or if you'd prefer to push your own restructuring on top, that works too. From the user perspective, all of this is gated under
Most of the things added here are covered by new unit tests. Happy to add an e2e test that runs an agent end-to-end and asserts on the resulting span tree, if you think that would help. Regarding the PII, correct me if I'm wrong, but I understood that the Docker telemetry (enabled by default) is in no way related to the telemetry exposed by OTel. OTel telemetry requires not only |
|
Hi @dgageot, @aheritier, Do you have any updates on how to move with this further? I see there are already conflicts again - I can resolve them, but since it drifts quite quickly, would be good to know if we planning to merge this. |
|
/review |
|
needs to be rebased (at the minimum). Moving it to draft |
|
❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs. |
- `pkg/telemetry/genai/` provides the GenAI semantic-conventions surface: span helpers (`ChatSpan`, `EmbeddingSpan`, `FallbackSpan`, `SandboxSpan`, runtime helpers), attribute / operation-name / provider-name constants per the OTel GenAI semconv, conversation-id baggage round-trippers, error classification, content-capture gating (`OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`), stability gating (`OTEL_SEMCONV_STABILITY_OPT_IN`), `gen_ai.client.token.usage` and operation-duration histograms, the `gen_ai.evaluation.result` log emitter, and process-boundary helpers (`InjectSandboxEnv`, `InjectTraceContextEnv`) - `pkg/telemetry/mcp/` provides MCP-specific telemetry: `ConversationIDFromBaggage`, span starters for client / server, `params._meta` propagation carrier, attribute constants, and metrics - Test files cover content gating, stability defaults, conversation propagation, and span lifecycle invariants
- `cmd/root/otel.go`: stand up `TracerProvider` / `MeterProvider` / `LoggerProvider` from a single `initOTelSDK` entry, configure OTLP/HTTP exporters with explicit-scheme endpoint normalization, set the global W3C trace-context + baggage propagator unconditionally, flush providers in dependency order, attach `service.*` / `host.*` / `process.*` / `os.type` / `host.arch` resource attributes, and use `AlwaysSample` so local agent sessions are not dropped by an upstream sampling decision - `pkg/httpclient/client.go`: add a `WrapWithOTel` round-tripper gated on a single `atomic.Bool` flipped by `initOTelSDK` (avoids the prior mismatch between `--otel` and the otelhttp wrap), plus `TracedDefaultClient` / `TracedClient` helpers for one-off HTTP calls - `cmd/root/sandbox.go`: open a host-side `sandbox.exec` span and inject the active W3C trace context as `-e KEY=VALUE` flags so processes inside the container chain onto the host trace - `cmd/root/new.go`, `cmd/root/otel_test.go`: wire tracer scope and cover the endpoint normalization / localhost detection cases - `go.mod` / `go.sum`: pull in `go.opentelemetry.io/otel` SDK + OTLP/HTTP exporters
…s and metrics
- `pkg/model/provider/instrument.go`: decorator that wraps any `Provider` with a `chat {model}` CLIENT span (per OTel GenAI semconv), opt-in capture of `gen_ai.input.messages` / `gen_ai.output.messages` / `gen_ai.tool.definitions`, request/response attributes including the Anthropic spec-sum input-token computation (input + cache_read + cache_creation), `gen_ai.client.token.usage` histogram, and `gen_ai.client.operation.duration` histogram. Six wrapper variants preserve the EmbeddingProvider / RerankingProvider capability surfaces so RAG fallbacks round-trip correctly
- `pkg/model/provider/factory.go`, `factory_test.go`: route construction through the decorator
- `pkg/model/provider/anthropic/client.go`, `files.go`: add `anthropic.tokens.count` and `anthropic.files.get_or_upload` spans for the overflow-retry token-counting path and the file-upload cache-or-create path; drop the unnecessary `string(model)` cast
…n, skills, and background agents
- `pkg/runtime/loop.go`: open `runtime.session` and `runtime.stream` INTERNAL spans seeded with `gen_ai.conversation.id` baggage at session start; mark the session span with `error.type=loop_detected` + `codes.Error` when the loop detector terminates
- `pkg/runtime/fallback.go`, `pkg/runtime/cache.go`: wrap the fallback chain with a `runtime.fallback` span carrying primary/final model, attempts, outcome, cooldown state; record provider-cache hit/backing on the cache span
- `pkg/runtime/agent_delegation.go`: emit `runtime.task_transfer` and `runtime.handoff` spans with `gen_ai.operation.name=invoke_agent` and `gen_ai.agent.name`
- `pkg/runtime/skill_runner.go`: emit `invoke_workflow {skill}` per spec
- `pkg/runtime/toolexec/dispatcher.go`: open `runtime.tool.call` and `runtime.tool.handler` spans with the GenAI execute_tool semconv, capture `gen_ai.tool.call.{arguments,result}` under the content-capture opt-in, and stamp `cagent.approval.{decision,source}` from `notifyApproval` so denied / canceled / read-only-allowed calls are distinguishable in trace dashboards
- `pkg/runtime/compactor/compactor.go`: wrap compaction with a span that carries summary tokens and cost
- `pkg/tools/builtin/agent/agent.go`: open a `background_agent.run` root span with a link back to the spawning context, and stamp `gen_ai.conversation.id` from baggage so the span participates in conversation-scoped queries
- `pkg/tools/startable.go`, `pkg/toolinstall/registry.go`: wrap toolset Start with a `toolset.start` span so capability discovery latency is attributable
…race context
- `pkg/hooks/executor.go`: open a single `hook.{event}` INTERNAL span per Dispatch covering every matched hook, then `annotateHookSpan` stamps the aggregated `Result` so denied / asked / allowed / modified-input / summary-provided cases are distinguishable. Verdict booleans and the structured decision/reason are unconditional; free-text `message` / `additional_context` / `system_message` / `summary` are gated on `OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT`
- `pkg/hooks/handler.go`: append `genai.InjectTraceContextEnv(ctx)` to the hook subprocess env so script-driven hooks that emit OTel spans (or call instrumented CLIs / LLM endpoints) chain onto the parent `hook.{event}` span instead of producing orphaned roots
- `pkg/mcp/server.go`: route the MCP HTTP transport through `otelhttp.NewHandler` and `otelmcp.StartServer` so inbound requests carry `traceparent` / `baggage` and emit a SERVER span per call - `pkg/tools/mcp/session_client.go`: wrap MCP client calls (`tools/list`, `tools/call`, `prompts/list`) with CLIENT spans using the params._meta propagation carrier. Iterator wrappers open the span inside the iterator closure (not at call time) so unused iterators do not leak spans, and end on every exit path including early `yield` returns - `pkg/tools/mcp/oauth.go`, `oauth_helpers.go`, `oauth_login.go`, `oauth_server.go`: wrap interactive OAuth flow and token refresh with `oauth.flow` / `oauth.token.refresh` CLIENT spans, route metadata HTTP calls through `httpclient.TracedClient` / `TracedDefaultClient`, and emit `oauth.step` span events at each network sub-step boundary (`fetch_protected_resource_metadata`, `fetch_authorization_server_metadata`, `dynamic_client_registration`, `request_authorization_code`, `token_exchange`) so a failure can be attributed to a specific stage without descending into HTTP children
…nt semconv - `pkg/a2a/server.go`: wrap the agent-card and JSON-RPC endpoints with `otelhttp.NewHandler` so inbound A2A requests extract `traceparent` / `tracestate` / `baggage` and emit a SERVER span. The outer `agent-a2a` server wrap covers any auxiliary routes - `pkg/a2a/adapter.go`: in `runDockerAgent`, decorate the active SERVER span with `gen_ai.operation.name=invoke_agent`, `gen_ai.agent.name`, and `cagent.agent.name`. Wires the runtime tracer scope so per-invocation `runtime.session` / `runtime.stream` / `runtime.tool.call` chain onto the inbound A2A span instead of starting fresh trace ids per request
…ints, and add cold-start spans - `pkg/server/server.go`: wrap the agent-api Echo handler with `otelhttp.NewHandler` so inbound API requests extract `traceparent` / `tracestate` / `baggage` and the runtime spans started downstream chain onto the calling client trace - `pkg/server/session_manager.go`: wire the runtime tracer scope into per-session runtime construction; open a `session.runtime_init` INTERNAL span on the cold path (team load + runtime construction) so per-request first-use latency is attributable. Cached hits skip the span — they are a pointer load - `pkg/chatserver/server.go`, `pkg/chatserver/runtime_pool.go`: wrap the chat completions HTTP server with `otelhttp.NewHandler` and propagate the runtime tracer through the per-session pool - `pkg/teamloader/teamloader.go`: open a `teamloader.load` INTERNAL span around `LoadWithConfig` so the cold-start path (config parse, model alias resolution, OCI agent pulls, toolset starts) becomes attributable - `pkg/acp/agent.go`: wire the runtime tracer into the ACP entry point so its sub-spans share scope with CLI / API runs
- `pkg/memory/database/sqlite/sqlite.go`: open `memory.{op}` spans on `AddMemory`, `SearchMemories`, etc., with named-return error capture so failures attach to the span via `RecordError`. The search path additionally emits a `retrieval` semconv span for cross-tool dashboards
- `pkg/rag/manager.go`: open `retrieval` (semconv) spans on `Query`, plus `rag.init` / `rag.reindex` / `rag.file_watcher` for lifecycle visibility
- `pkg/sessiontitle/generator.go`: wrap title generation with a `sessiontitle.generate` span; named-return errors fold onto the span on failure
- `pkg/evaluation/judge.go`: emit `gen_ai.evaluation.result` log events from the LLM-as-judge evaluator with score / explanation / error.type, linked to the active span via context for cross-signal join
- `pkg/tools/builtin/shell.go`, `script_shell.go`: stamp `cagent.tool.{shell,script_shell}.{cmd,cwd,timeout_seconds}` on the active `runtime.tool.handler` span. Cmd ships unconditionally because it is the main signal of what the agent did; redact at the OTel collector if commands carry secrets
- `pkg/tools/builtin/filesystem.go`: stamp `cagent.tool.filesystem.{op,path,paths,path_count}` covering all file operations. Paths ship unconditionally for the same incident-response reason
- `pkg/tools/builtin/fetch.go`: stamp `cagent.tool.fetch.{urls,url_count,format}`; each fetched URL still emits its own HTTP CLIENT child span via `httpclient.WrapWithOTel`
- `pkg/tools/builtin/lsp.go`: wrap every tool from `lspTool` so each LSP RPC stamps `cagent.tool.lsp.{tool,read_only}` on the parent span
- `pkg/tools/builtin/lsp_lifecycle.go`: inject `genai.InjectTraceContextEnv(ctx)` into the LSP server spawn env so OTel-aware language servers chain onto the agent trace
- `pkg/tools/builtin/openapi.go`, `pkg/tools/builtin/api.go`: route the user-facing HTTP clients through `httpclient.WrapWithOTel(remote.NewTransport(ctx))` so each API call emits a CLIENT span and propagates `traceparent`
- `pkg/tools/codemode/exec.go`: stamp `cagent.tool.codemode.{script,script_length,tool_call_count}` so a code-mode turn is visible as "ran N lines of JS that called M tools"
…tion Wrap the HTTP transport chain with `httpclient.WrapWithOTel` so every outbound MCP request injects W3C `traceparent` headers and creates an HTTP CLIENT span. Without this wrap, the streamable-HTTP/SSE transports the gomcp SDK builds send raw POST/GET requests that never chain onto the calling cagent span—the downstream MCP server's spans then live in a separate root trace, breaking end-to-end observability for any agent talking to a remote MCP server. `WrapWithOTel` is a no-op when OTel is disabled at runtime, so the laptop-mode default stays unchanged.
Every toolset goes through tools.WithName in the team-loader
registry, which sandwiches a *tools.namedToolSet between the
StartableToolSet and the actual implementation. %T on the
embedded ToolSet therefore always reported *tools.namedToolSet
regardless of whether the inner toolset was MCP, A2A, a builtin,
or anything else - so the attribute could never answer the
question it exists to answer ("which kind of toolset is starting
right now?").
Unwrap once before formatting, mirroring what DescribeToolSet
already does for the same reason. Now the attribute reads
*mcp.Toolset, *builtin.ShellTool, etc., so a toolset.start
without HTTP children is immediately distinguishable from a
remote MCP whose POSTs are missing for some other reason.
Record tool counts at two key points in the execution flow: - Session span: total tools available after exclusion filters - MCP list span: tools successfully yielded by each server These attributes enable quick analysis of tool availability without inspecting nested spans or JSON-RPC payloads. The MCP count preserves partial results when iteration terminates early.
…errors Introduce a `classifyByStatusCode` helper that probes for an HTTP status code via a `StatusCode() int` method before falling back to substring matching. This prevents false positives when error messages incidentally contain strings like "401", "403", or "429" in request IDs, byte counts, or status-line fragments. Providers that expose HTTP status codes through a structured interface now get classified from the structural signal, while text-only errors continue to use the existing heuristic. Also add documentation clarifying that `getInstruments` binds to the global MeterProvider on first call via `sync.Once`, which affects test setup requirements.
d48faf7 to
946df1b
Compare
Rebased. |
|
Note, this doesn't happen all the time, I just started a new session and there were no warnings |
Hi @rumpl, The warnings seem to be from Jaeger's clock-skew adjuster, not from docker-agent. They fire when Jaeger queries the trace before all of the span batches have flushed (children get there before their parent). Does re-loading the trace later (after 30s or so) make the warnings disappear? Could you see if enabling more eager flushing via |


Adds end-to-end OpenTelemetry instrumentation following the GenAI semantic conventions:
chat/embeddings/rerankCLIENT spans withgen_ai.*attributes and thegen_ai.client.token.usage/operation.durationhistograms.runtime.session,runtime.stream,runtime.fallback,runtime.tool.call,runtime.run_skill,runtime.task_transfer,runtime.handoff,background_agent.run).params._metapropagation, plus OAuth flow spans.otelhttpand marked asinvoke_agent.docker exec.service.*,host.*,process.*,os.type)This PR wires two opt-in env vars beyond the default OTel SDK ones:
OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT— capture prompts, responses, tool arguments and tool results as span attributes. Off by default (PII surface).OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental— emit only the spec-definedgen_ai.*keys. Default is dual-emit (bothgen_ai.*and the legacytool.name/agent/session.idkeys), so existing dashboards keep working alongside spec-aware tooling.The diff is large — ~50 files, ~5k lines. It's split into 10 topical commits (telemetry primitives → SDK init → providers → runtime → hooks → MCP → A2A → servers/cold-start → memory/RAG → tool internals) so each commit is independently reviewable. Most of the volume is in the new
pkg/telemetry/genai/andpkg/telemetry/mcp/packages, which are pure helpers; the surface-area changes elsewhere are 1-3 lines per call site.